- Friday, September 27, 2024
DALDA, which stands for Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling, is a framework designed to enhance data augmentation techniques, particularly in scenarios where data is scarce. This innovative approach utilizes both a Large Language Model (LLM) and a Diffusion Model (DM) to generate semantically rich images. By embedding novel semantic information into text prompts through the LLM and employing real images as visual prompts, DALDA effectively addresses the challenge of generating samples that remain within the target distribution. The installation process for DALDA involves creating a virtual environment and installing necessary dependencies. Users are guided to set up a conda environment, activate it, and install the required packages from a requirements file. Additionally, users are instructed to download specific models and datasets, including the Flowers102, Oxford Pets, and Caltech101 datasets, with detailed commands provided for each. To generate prompts using the LLM, specifically GPT-4, users must create a configuration file that includes their Azure endpoint and API key. Once the environment is set up, prompts can be generated by executing a designated script. The framework also includes functionality for training classifiers, with instructions on how to run the training script and utilize a resume feature for ongoing training sessions. The development of DALDA draws on existing code from projects like DA-Fusion and integrates components from IP-Adapter, diffusers, and CLIP, ensuring compliance with their respective licenses. The repository is publicly accessible on GitHub, where it has garnered attention with stars and forks, indicating community interest and collaboration. The project is positioned within the broader context of data augmentation, synthetic data generation, and the application of diffusion models and large language models in machine learning.
- Thursday, March 21, 2024
DreamDA offers a new approach to data augmentation, utilizing diffusion models to synthesize diverse, high-quality images that closely match the original data distribution.
- Wednesday, March 20, 2024
Stable Diffusion 3 is a powerful image generation model. This paper introduces Latent Adversarial Diffusion Distillation, which reduces the number of diffusion steps down to 4 while maintaining image generation quality.
- Friday, July 5, 2024
Method from Google to insert semantic objects into images with diffusion. Dataset and demo available.
- Tuesday, April 16, 2024
This post explores how to train diffusion models to generate video, how to adapt image models, and even how to generate video from an image model without additional training.
- Tuesday, June 18, 2024
This paper investigates why diffusion-based image generation models create "hallucinations" — images that never appeared in the training data.
- Wednesday, March 6, 2024
Stable Diffusion 3, with its novel Multimodal Diffusion Transformer architecture, surpasses leading text-to-image models by enhancing prompt comprehension and typography through separate processing weights for text and images, promising advancements in AI-generated visual content.
- Monday, September 16, 2024
Google's DataGemma models address the issue of hallucinations in LLMs by grounding them in real-world data from the Data Commons knowledge graph. Two approaches are used: Retrieval Interleaved Generation (RIG) and Retrieval Augmented Generation (RAG). RIG fine-tunes the model to identify statistics and verify them against Data Commons, while RAG retrieves relevant information before the LLM generates text.
- Thursday, August 22, 2024
Amazing new model from Meta that does both next token prediction and diffusion on interleaved text and images. It matches benchmark performance on text and images with previous generation models like Dalle 2 and Llama 2.
- Thursday, July 25, 2024
INF-LLaVA is a Multimodal Large Language Model (MLLM) designed to overcome the limitations of processing high-resolution images.
- Wednesday, October 2, 2024
The paper titled "MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning" introduces a new family of multimodal large language models (MLLMs) aimed at improving capabilities in various areas such as text-rich image understanding, visual referring and grounding, and multi-image reasoning. This work builds on the previous MM1 architecture and emphasizes a data-centric approach to model training. The authors systematically investigate the effects of diverse data mixtures throughout the model training lifecycle. This includes the use of high-quality Optical Character Recognition (OCR) data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. The models developed range from 1 billion to 30 billion parameters and include both dense and mixture-of-experts (MoE) variants. The findings suggest that with careful data curation and training strategies, strong performance can be achieved even with smaller models, specifically those with 1B and 3B parameters. Additionally, the paper introduces two specialized variants of the MM1.5 model: MM1.5-Video, which is tailored for video understanding, and MM1.5-UI, designed for mobile user interface understanding. Through extensive empirical studies and ablation experiments, the authors provide detailed insights into the training processes and decisions that shaped their final model designs. This research offers valuable guidance for future developments in multimodal large language models, highlighting the importance of data quality and training methodologies in achieving effective model performance. The paper was submitted on September 30, 2024, and is categorized under subjects such as Computer Vision and Pattern Recognition, Computation and Language, and Machine Learning. The authors express gratitude for the support received from various institutions and contributors, indicating a collaborative effort in advancing the field of multimodal learning.
- Friday, March 29, 2024
CoDA is a new approach to Unsupervised Domain Adaptation (UDA). It helps AI models better adapt to unlabeled, challenging environments by learning from differences at both the scene and image levels.
- Wednesday, April 17, 2024
Vision Language Models (vLLMs) often struggle with processing multiple queries per image and identifying when objects are absent. This study introduces a new query format to tackle these issues, and incorporates semantic segmentation into the training process.
- Friday, May 10, 2024
Predicting more than one token at a time is an interesting paradigm of active research. If successful, it would dramatically improve generation time for many large language models. The approach in this post, which mirrors consistency models from image synthetics, attempts to use a parallel decoding strategy on fine-tuned LLMs to speed up generation. Early results match speculative decoding performance of 3x.
- Wednesday, June 5, 2024
Fantastic diffusion paper that diffuses code for images. It can directly make edits as part of the diffusion process. It is slow, but can be combined easily with search to dramatically improve reasoning ability.
- Friday, June 14, 2024
Stable Diffusion 3 Medium is out. A cutting-edge text-to-image model that generates photorealistic images with 2 billion parameters, it overcomes common artifacts in hands and faces, handles complex prompts, and features enhanced typography. Despite recent legal and financial challenges, Stability AI continues to push the boundaries of generative AI, with future upgrades planned across video, audio, and language.
- Friday, October 4, 2024
ComfyGen introduces a novel approach to text-to-image generation by focusing on prompt-adaptive workflows. This method recognizes the shift in the user community from using simple, monolithic models to more complex workflows that integrate various specialized components. These workflows can significantly enhance image quality, but creating them requires considerable expertise due to the multitude of available components and their intricate interdependencies. The core innovation of ComfyGen is the automation of workflow generation tailored to specific user prompts. This is achieved through the introduction of two large language model (LLM) baselines. The first is a tuning-based method that learns from user-preference data, while the second is a training-free method that utilizes the LLM to select from existing workflows. Both methods demonstrate improved image quality compared to traditional monolithic models or generic workflows that do not adapt to specific prompts. The implementation of ComfyGen is built around ComfyUI, an open-source tool designed for creating and executing text-to-image pipelines. These pipelines are structured in a JSON format, which is conducive for LLM predictions. To train the LLM on effective workflows, a collection of human-created ComfyUI workflows is augmented by randomly altering parameters such as the base model, LoRAs, samplers, and other settings. A set of 500 prompts is then used to generate images with each workflow, which are scored based on aesthetic appeal and human preferences. This process results in a dataset of (prompt, flow, score) triplets. ComfyGen explores two main approaches for workflow prediction. The first is an in-context method where the LLM is provided with a table of workflows and their corresponding scores, allowing it to select the most suitable one for a new prompt. The second approach involves fine-tuning the LLM with input prompts and scores to predict the optimal workflow for achieving high-quality results. Comparative evaluations show that ComfyGen outperforms both monolithic models and fixed, prompt-independent workflows across various metrics, including human preference and prompt alignment benchmarks. The results from user studies and established benchmarks like GenEval further validate the effectiveness of the proposed methods. In summary, ComfyGen represents a significant advancement in the field of text-to-image generation by automating the creation of tailored workflows that enhance image quality, thereby providing a new avenue for improving user experience in this domain.
- Wednesday, October 2, 2024
NVIDIA has introduced NVLM 1.0, a series of advanced multimodal large language models (LLMs) that excel in vision-language tasks, competing with both proprietary models like GPT-4o and open-access models such as Llama 3-V 405B and InternVL 2. The NVLM-D-72B model, which is part of this release, is a decoder-only architecture that has been open-sourced for community use. Notably, NVLM 1.0 demonstrates enhanced performance in text-only tasks compared to its underlying LLM framework after undergoing multimodal training. The model has been trained using the Megatron-LM framework, with adaptations made for hosting and inference on Hugging Face. This adaptation allows for reproducibility and comparison with other models. Benchmark results indicate that NVLM-D 1.0 72B achieves impressive scores across various vision-language benchmarks, such as MMMU, MathVista, and VQAv2, showing competitive performance against other leading models. In addition to multimodal benchmarks, NVLM-D 1.0 also performs well in text-only benchmarks, showcasing its versatility. The model's architecture allows for efficient loading and inference, including support for multi-GPU setups. Instructions for preparing the environment, loading the model, and performing inference are provided, ensuring that users can effectively utilize the model for their applications. The model's inference capabilities include both text-based conversations and image-based interactions. Users can engage in pure-text dialogues or ask the model to describe images, demonstrating its multimodal capabilities. The documentation includes detailed code snippets for loading images, preprocessing them, and interacting with the model. The NVLM project is a collaborative effort, with contributions from multiple researchers at NVIDIA. The model is licensed under the Creative Commons BY-NC 4.0 license, allowing for non-commercial use. The introduction of NVLM 1.0 marks a significant advancement in the field of multimodal AI, providing powerful tools for developers and researchers alike.
- Tuesday, March 12, 2024
The novel Stealing Stable Diffusion (SSD) approach boosts the accuracy of monocular depth estimation in difficult environments like low-light or rainy conditions.
- Friday, August 2, 2024
The creators of VQGAN, Stable Diffusion, Latent Diffusion, and other startups have raised $30m+ dollars and started a new company. They have released new flagship image generation models which are extremely capable and come in a variety of tiers.
- Monday, September 30, 2024
The paper titled "Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs" presents a novel approach to image super-resolution (SR) using diffusion-based models. Authored by Qinpeng Cui and a team of eight other researchers, the work addresses the challenges faced by existing diffusion models, particularly in balancing efficiency and performance. Diffusion-based SR models have gained popularity due to their strong capabilities in image restoration. However, many of these models either fail to leverage the full potential of pre-trained models, which limits their generative abilities, or they require numerous forward passes starting from random noise, leading to inefficiencies during inference. The authors introduce a new model called DoSSR, which stands for Domain Shift diffusion-based SR. This model enhances efficiency by initiating the diffusion process with low-resolution images, thereby capitalizing on the generative strengths of pre-trained diffusion models. Central to the proposed method is a domain shift equation that integrates smoothly with existing diffusion models. This integration not only optimizes the use of diffusion prior but also significantly improves inference efficiency. The authors further advance their approach by transitioning from a discrete shift process to a continuous formulation, referred to as DoS-SDEs. This transition enables the development of fast and customized solvers that enhance sampling efficiency. Empirical results indicate that the DoSSR model achieves state-of-the-art performance on both synthetic and real-world datasets, requiring only five sampling steps. This represents a substantial improvement in speed, achieving a remarkable speedup of 5-7 times compared to previous diffusion prior-based methods. The paper has been accepted for presentation at NeurIPS 2024, highlighting its relevance and contribution to the field of computer vision and pattern recognition. The authors express gratitude for the support received from various institutions, including the Simons Foundation, and provide a link to access the full paper for further details.
- Tuesday, June 25, 2024
Despite GPT-4o's advanced imaging capabilities, OpenAI is still actively enhancing DALL-E 3, focusing on refining text rendering and visual accuracy. Facing stiff competition from Midjourney and Ideogram, OpenAI's strategy underscores the continuous evolution and challenges in AI-driven visual technologies.
- Wednesday, June 26, 2024
Together AI and Morph Labs have put together a great blog post on tuning models for retrieval augmented generation. They showcase some uses of synthetic data as well.
- Friday, March 22, 2024
Diffusion State Space Models (DiS) are a new type of diffusion model that use a state space backbone instead of the traditional U-Net for image data. These models can handle long-range dependencies and are efficient in generating high-quality images with less computational effort.
- Thursday, April 4, 2024
OpenAI's DALL-E now offers image editing tools both on the web and on mobile. There are preset style suggestions to help inspire image creation. The image generation platform has been integrated with ChatGPT - users can now edit DALL-E images in ChatGPT across web, iOS, and Android. Videos from OpenAI showing off the new features are available in the article.
- Thursday, September 26, 2024
Llama 3.2 has been introduced as a significant advancement in edge AI and vision technology, featuring a range of open and customizable models designed for various applications. This release includes small and medium-sized vision large language models (LLMs) with 11 billion and 90 billion parameters, as well as lightweight text-only models with 1 billion and 3 billion parameters. These models are optimized for deployment on edge and mobile devices, making them suitable for tasks such as summarization, instruction following, and rewriting, all while supporting a context length of 128,000 tokens. The vision models are designed to excel in image understanding tasks, providing capabilities such as document-level comprehension, image captioning, and visual grounding. They can process both text and image inputs, allowing for complex reasoning and interaction with visual data. For instance, users can query the model about sales data represented in graphs or seek navigational assistance based on maps. The lightweight models, on the other hand, focus on multilingual text generation and tool-calling functionalities, enabling developers to create privacy-focused applications that operate entirely on-device. Llama 3.2 is supported by a robust ecosystem, with partnerships established with major technology companies like AWS, Databricks, and Qualcomm, ensuring that the models can be easily integrated into various platforms. The release also includes the Llama Stack, a set of tools designed to simplify the development process across different environments, including on-premises, cloud, and mobile devices. The models have undergone extensive evaluation, demonstrating competitive performance against leading foundation models in both image recognition and language tasks. The architecture of the vision models incorporates new adapter weights that allow for seamless integration of image processing capabilities into the existing language model framework. This innovative approach ensures that the models maintain their text-based functionalities while expanding their capabilities to include visual reasoning. In addition to the technical advancements, Llama 3.2 emphasizes responsible AI development. New safety measures, such as Llama Guard, have been introduced to filter inappropriate content and ensure safe interactions with the models. The lightweight versions of the models have been optimized for efficiency, making them more accessible for deployment in constrained environments. Overall, Llama 3.2 represents a significant leap forward in the field of AI, promoting openness and collaboration within the developer community. The models are available for download and immediate development, encouraging innovation and the creation of new applications that leverage the power of generative AI. The commitment to responsible AI practices and the continuous engagement with partners and the open-source community highlight the potential for Llama 3.2 to drive meaningful advancements in technology and society.
- Thursday, April 4, 2024
DALL-E images can now be modified using a new editor interface from OpenAI that lets users describe changes using text prompts. Users can use the new select button to give specific instructions for a particular part of an image. Alternatively, users can make general changes to the image by entering a prompt in the chat sidebar.
- Wednesday, April 24, 2024
SEED-X advances multimodal foundation models by tackling real-world application challenges. It can understand images of any size and aspect ratio and produce images with varying levels of detail.
- Friday, April 19, 2024
The launch of OpenAI's DALL-E 2 in April 2022 marked a groundbreaking and tumultuous period in AI history, as a tight-knit group of artists and tech enthusiasts explored the intersection between language and visual arts using the technology. However, the amazement and exhilaration soon gave way to concerns about the ethics of training AI models on copyrighted creative work without permission or compensation, leading to a polarizing debate that continues to reverberate in the AI space as OpenAI moves on to DALL-E 3 and other AI image synthesis models emerge.
- Thursday, April 18, 2024
Stability AI has made its latest text-to-image AI model, Stable Diffusion 3, available to some developers via API and its new content creation platform called Stable Assistant Beta. The model is still in preview and not yet available to the general public.